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1  Introduction 
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The  overall  objective  of  the  URI  project  was  to  provide  users  of  high-performance  parallel  and 
vector  machines  with  an  environment  that  makes  the  task  of  programming  easier.  The  environment 
unifies  a  set  of  tools.  Many  of  these  tools  are  specific  to  the  problem  of  parallel  programming,  such 
as  restructuring  compilers,  parallel  debuggers,  performance  evaluation  tools,  data  visualization  and 
computational  steering  tools. 

The  design  takes  into  consideration  the  entire  process  of  scientific/engineering  project  develop¬ 
ment  Tliis  process  can  be  characterized  as  a  cycle  of  distinct  transformations  applied  to  a  practical 
problem.  Beginning  with  a  description  or  definition  of  a  physical  problem,  the  user  must  design  and 
build  a  source  program,  debug  it  optimize  it,  and  finally  execute  it  to  review  its  results.  Figure  1 
illustrates  the  five  phases  of  this  cyclic  process.  The  ultimate  goal  of  the  project  was  to  provide 
tools  to  aid  users  in  all  steps  along  the  way  during  this  transformation  process. 

Referring  to  Figure  1,  the  user  enters  the  loop  at  Phase  1,  the  physical  problem  domain.  Here 
the  user  thinks  about  an  application  in  the  context  of  a  particular  problem  domain.  The  first  step  in 
the  solution  process  is  the  transformation  of  the  user’s  problem  from  the  physical  problem  domain 
to  the  source  code  domain  (Phase  2).  In  the  source  code  domain,  the  problem  is  expressed  as  a 
sequence  of  algorithms  encoded  in  a  particular  programming  language,  frequently  Fortran  or  C. 

Once  the  problem  has  been  mapped  into  a  source  code  representation,  the  user  can  begin  to 
evolve  it  toward  a  running  program.  This  is  the  internal  program  domain  (Phase  3).  The  transition 
to  this  internal  space  requires  several  tools.  First,  source  language  compilers  are  needed  to  correct 
and  translate  the  source  code  “interpretations”  of  the  physical  problem  into  a  form  that  coincides 
with  the  traditional  programming  paradigm.  The  internal  program  domain  is  still  characterized 
by  its  machine  independence.  Although  the  original  physical  problem  has  been  rendered  into  a 
form  that  can  be  executed  on  a  specific  hardware  platform,  no  machine-dependent  aspects  of  the 
implementation  have  yet  entered  the  design. 

In  the  internal  program  domain,  the  user  can  also  investigate  various  machine-independent 
aspects  of  the  application  that  pertain  to  parallelization.  Using  the  data  dependence  and  control 
dependence  information  that  exists  in  this  space,  the  user  may  apply  interactive  and  automatic 
restructuring  tools  to  the  application  for  the  purpose  of  improving  the  potential  performance  of  the 
program.  We  provided  several  tools  that  operate  in  this  domain.  The  user  may  invoke  automatic 
restructuring  compilers  to  operate  on  applications.  Interactive  tools  such  as  the  Sigma  editor  allow 
the  user  manually  to  peruse  the  application  while  querying  the  system  about  various  aspects.  The 


1 


Physicai  ProtMem  Oonuin 


Dau  S(ructuft  Oomtin  ProffamlniefluKScxnaiuic^Domjin 


Figure  1:  Program  Development  Cycle 

compiler  MIPRAC,  in  addition  to  generating  parallel  code  for  programs  written  in  Fortran,  C,  and 
Lisp,  can  provide  users  with  detailed  insight  into  the  dependence  relationships  within  the  code. 

From  the  internal  program  domain,  the  application  is  transformed  into  the  program  data  domain 
(Phase  4).  This  domain  is  characterized  primarily  by  the  existence  of  debugging  tools  that  allow  the 
user  to  validate  the  integrity  of  an  application  by  verihcation  of  program  state  at  various  stages  of 
its  execution.  This  internal  state  is  composed  of  data  elements;  that  is,  the  condition  of  the  program 
is  evaluated  with  respect  to  the  value  of  all  of  the  internal  data  structures  on  which  it  is  operating. 
The  debugging  tool  MDB,  an  efficient  machine-level  debugger  for  the  Cedar  multiprocessor,  was 
developed  as  part  of  this  effort. 

Once  a  program  is  running  correctly,  development  moves  into  the  physical  machine  domain 
(Phase  5),  and  performance  improvement  can  begin.  This  involves  not  only  gathering  and  present¬ 
ing  performance  data,  but  also  interpreting  this  data.  The  improvement  process  is  characterized 
by  program  profiling,  program  tuning,  and  performance  preffiction.  The  user’s  conceptual  view 
of  a  parallel  program’s  operation  often  differs  substantially  from  the  actual  execution  behavior. 
To  understand  why  a  program  might  be  running  slowly,  a  global  view  of  performance  behavior  is 
needed.  New  performance  visualization  tools  help  the  user  reconcile  these  differing  views.  The 
user  may  examine  execution  traces  with  Trace  View  or  may  peruse  trace-based  profiling  information 
using  CPROF  and  SuperVu. 

Piugi  am  tuning  requires  finer  control  and  additional  performance  detail  (e.g.,  loop  or  statement 
flow).  The  coupling  of  performance  tools  with  program  restructuring  software  permits  automatic 
instrumentation  and  display  of  performance  data,  to  identify  those  portions  of  the  code  suitable  for 


additional  performance  tuning. 

Once  the  program  achieves  an  acceptable  level  of  performance,  the  user  can  execute  it  and 
evaluate  its  results.  Because  the  program  was  originally  developed  to  help  a  scientist  or  engineer 
with  a  particular  real-world  problem,  there  is  much  current  work  on  rendering  the  results  in  a 
manner  that  resembles  the  original  problem  domain.  We  developed  a  series  of  visualization  tools 
for  output  rendering  that  close  the  loop  in  the  program  development  process.  VISTA,  and  its 
successor  VASE,  provide  a  basis  for  distributed  execution  and  visualization  of  applications.  VASE 
also  provides  a  sophisticated  framework  for  steering  applications  during  execution,  allowing  the 
user  to  adjust  problem  parameters  or  examine  data  structures  in  depth  without  a  tedious  and 
expensive  edit-compile-execute  cycle. 

The  phases  described  above  overlap  fairly  significantly;  therefore,  tools  used  in  one  phase 
are  often  useful  in  others.  For  example,  the  steering  and  visualization  capabilities  provided  by 
VASE  are  valuable  in  achieving  correct  execution  on  a  particular  machine  (Phase  4),  optimizing 
its  performance  for  a  particular  dataset  (Phase  5),  and  experimenting  with  alternative  algorithms 
(Phase  2). 

The  tools  required  to  assist  in  the  overall  program  development  cycle  can  be  categorized  as 
follows: 

•  overall  environment  tools  (phase  1—^2) 

•  compilers  and  program  structure  analysis/query  tools  (phase  2  — ♦  3) 

•  debugging  and  data  visualization  tools  (phase  3  — ►  4) 

•  performance  instrumentation  and  performance  visualization  tools  (phase  4  -♦  5) 

The  work  supported  by  this  grant  is  an  extension  of  the  original  Faust  project,  which  was 
supported  by  AFOSR  grant  F49620-86-C-0136.  With  the  support  of  the  first  grant,  a  set  of  X-based 
environment  tools  (including  a  graph  manager,  text  manager,  and  a  comprehensive  project  manager) 
were  developed  to  form  the  foundation  for  a  portable,  integrated  environment.  However,  during 
the  same  period  outside  research  and  development  effons,  notably  new  user  interface  toolkits  for 
version  1 1  of  the  X  Wmdow  System,  were  gaining  widespread  acceptancce  as  portable  standard 
interfaces.  Recognizing  that  the  goals  of  portability  and  integration  could  be  met  by  adopting  this 
industry-standard  platform,  the  work  supported  by  this  follow-on  grant  focused  principally  upon 
tool  development  within  an  Xll  framework.  The  remaining  sections  of  this  report  describe  the 
accomplishments  in  these  areas. 

2  Compiler  and  Program  Structure  Tools 

Compilers  transform  source  code  into  object  code.  Thus,  referring  back  to  Figure  1,  compiler 
tools  move  ,the  user  from  Phase  2  to  Phase  3.  Restructuring  compilers  for  shared-memory  parallel 
processors  are  the  subject  of  considerable  research  at  the  Center  for  Supercomputing  Research  and 
Development.  Several  avenues  of  research  were  pursued  as  part  of  the  this  project. 
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2.1  Sigma  (D.  Ganrton) 


The  Sigma  system  was  one  of  the  original  components  of  the  Faust  project  ( 1 J.  In  its  initial  form 
it  was  an  integrated  tool  for  interactive  restructuring  of  Fortran  programs  [2],  In  the  current  phase 
of  the  AFOSR  contract,  the  system  was  redesigned  to  provide  basic  compiler  technology  support 
for  a  wide  variety  of  projects.  More  specifically,  core  elements  of  Sigma  consisted  of  modules  that 
could  provide  support  the  complete  syntactic  and  semantic  analysis  of  Fortran  and  C  programs. 
It  was  decided  that  in  the  second  version  of  Sigma,  called  Sigma  II,  that  this  support  should  be 
provided  to  a  wider  variety  of  tools  that  needed  to  understand  C  and  Fortran  programs.  Sigma 
was  then  redesigned  as  a  ponable  “tool  kit”  for  building  interactive  and  static  program  analysis 
systems. 

Sigma  II  is  intended  use  is  for  building  end-user  tools  to  aid  in  the  process  of  program  par¬ 
allelization  and  performance  analysis.  Simply  put.  Sigma  II  is  a  data  base  for  accessing  and 
manipulating  program  control  and  data  dependence  information.  The  interface  to  the  end-user  tool 
consists  of  over  100  C  functions  that  can  be  used  to  access  and  modify  the  data  base. 

Support  for  multiple  source  languages  is  supported  in  Sigma  II.  Input  to  the  analysis  data 
base  can  be  either  Fortran  77,  Fortran  90,  (some  support  for  PCF  Fortran  is  also  provided),  C, 
C-H-  and  pC-H-  (which  is  extension  to  C-h-  for  a  variation  on  data  parallel  programming  we  call 
“object  parallel”  computation  [3]).  While  this  docs  not  cover  all  the  languages  used  for  parallel 
programming,  it  does  cover  the  majority  of  areas  where  large  applications  exist. 

While  there  is  little  support  for  extensions  to  standard  languages,  Sigma  II  provides  support 
for  a  very  powerful  annotation  and  directive  system  that  lets  users  add  source  code  comments  that 
can  be  passed  to  the  tool.  These  annoutions  can  be  used  for  data  distribution  directives,  program 
instrumentation  commands  or  restructuring  directives. 

The  functions  provided  by  the  data  base  can  be  used  by  the  end-user  tool  in  one  of  three  ways. 

•  Extracting  syntactic  information.  This  includes  symbol  and  type  table  information,  program 
control  flow  structure,  user  annotation  and  directive  information,  and  access  to  the  complete 
parse  tree  of  the  application. 

•  Extracting  semantic  information.  This  includes  interprocedural  definition  and  use  sum¬ 
maries  for  each  function  and  procedure,  data  dependence  analysis  and  scalar  propagation 
and  symbolic  analysis  and  simplification  of  scalar  expressions. 

•  Restructuring  the  program  by  modifying  the  data  base  contents  and  to  generating  (unparsing) 
new  versions  of  the  source  code  based  on  the  modifications. 

Parallel  programming  tools  are  designed  to  work  with  user  application  program  which  are  often 
very  large  and  exist  as  multiple  source  files.  In  Sigma  terms,  each  application  program  defines  a 
project  which  consists  of  a  set  of  source  files,  a  set  of  dependence  files  and  a  project  file.  For  each 
source  file  there  is  a  corresponding  dependence  file  generated  by  the  parser.  The  dependence  file 
for  each  source  file  contains  a  complete  parse  tree  and  symbol  table  for  the  program  fragment  in 
that  file.  In  addition  it  contains  a  first-pass  analysis  of  the  interprocedural  flow  of  for  the  functions 
and  subroutines  in  that  module. 

The  dependence  files  for  an  application  constitute  the  data  base.  Each  dependence  file  is  a 
collection  of  graphs  and  tables  corresponding  to  the  structured  parse  tree  for  the  program.  The  tool 
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builder  can  access  and  modify  the  information  in  the  data  base  by  means  of  the  Sigma  function 
libraiy. 

Examples  of  the  types  of  tools  that  Sigma  was  designed  to  support  include  the  following. 

•  Interactive  program  restructuring  tools  to  aid  in  exploiting  parallelism.  A  prototype  of 
this  application  of  Sigma  was  demonstrated  in  the  SigmaCS  system  which  was  base^  on  an 
extension  of  the  popular  EMACS  program  editor  [4],  Another  component  of  the  program 
restructuring  effort  involved  a  details  study  of  the  way  cache  memory  behavior  can  be 
improved  by  subtle  program  restructuring.  This  work  was  published  in  several  conference 
proceedings  but  best  summarized  in  [5]  and  the  Ph.  D.  thesis  [6]. 

•  Performance  analysis  tools  that  incorporated  the  structure  and  semantics  of  the  application 
into  the  users  view  of  the  computation.  TTiis  has  been  a  very  fruitful  area  of  Sigma  application. 
Our  initial  experiments  with  extending  Sigma  for  performance  analysis  focused  on  using  the 
cache  studies  cited  above  to  predict  program  behavior  [7]  which  culminated  in  the  Ph. 
D.  thesis  [8].  A  second  performance  analysis  project  used  Sigma  as  a  component  of  a 
temporal  relational  algebra  for  a  “Performance  Spreadsheet”.  This  project  Sieve.l  [9, 10]  is 
a  Performance  Evaluation  “Spreadsheet  ”  which  allows  users  to  interactively  explore  parallel 
program  event  trace  files  and  ask  “what  if,..?”  questions  about  possible  parallel  optimization. 
Sekhar  Sarukkai,  who  has  just  completed  his  Ph.  D.  thesis[l  1],  directed  this  work.  Another 
student,  Suresh  Srinivas,  is  continuing  this  work  but  with  a  greater  focus  on  performance 
animation.  Additional  performance  analysis  work  that  was  supported  by  sigma  tools  was  a 
study  of  network  traffic  on  the  BBN  series  of  machines  [12]. 

•  Scientific  visualization  and  computational  steering  tools  that  provide  a  way  for  the  user  to  see 
application  data  structures  and  to  modify  the  progress  of  the  computation  without  requiring 
the  user  to  modify  the  code.  This  areas  is  the  most  advanced  application  of  tool  kits  like 
Sigma.  The  interactive  steering  project  described  in  this  report  represents  the  state  of  the  art. 

•  Parallel  programming  language  design  and  implementation.  Our  work  in  this  area  has 
focused  on  parallel  extensions  of  the  C  and  C++  programming  language.  The  first  effort  was 
an  experiment  in  vectorizing  C  [13].  Over  the  past  two  year  we  have  focused  on  a  Sigma 
based  tool  that  translates  a  dialect  of  C++  to  run  on  multiprocessor  systems  [3, 14-17]  and 
a  Ph.D.  thesis  [18].  This  effort  has  grown  into  a  separate  DARPA  sponsored  endeavor  to 
define  a  standard  for  “High  Performance  C++”. 

The  first  version  of  the  Sigma  system  is  now  complete  and  has  been  distributed  to  many 
sites.  The  list  of  places  where  Sigma  has  been  installed  and  used  in  research  projects  include: 
CSRD,  Indiana  University,  Cornell  University,  Los  Alamos,  University  of  Colorado,  University 
of  California  San  Diego,  Rice  University,  University  of  Wisconsin,  University  of  Washington, 
Australian  National  University,  University  of  Michigan,  NASA  Ames,  University  of  Edinburgh, 
University  of  Rennes,  University  of  Southampton,  and  the  University  of  Delft.  It  is  likely  that  it  is 
in  use  in  other  institutions  but  this  list  contains  only  those  sites  that  where  we  are  engaged  in  direct 
and  ongoing  communications. 

Sigma  II  has  proven  to  be  of  substantial  value  in  the  construction  of  a  number  of  parallel 
programming  tools.  By  providing  a  toolkit  which  provides  researchers  an  easy  way  to  access 
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Fortran  90,  C  and  C-h-  application  syntax  and  semantic^,  we  feel  new  ideas  can  be  more  quickly 
tested  and  production  systems  can  be  more  easily  built. 

The  complete  source  code,  documentation  and  example  applications  are  available  by  anony¬ 
mous  FTP.  However,  it  must  be  stated  that  the  system  is  still  evolving.  Our  Fortran  90  front  end 
has  not  been  extensively  tested  because,  at  the  time  of  this  writing,  there  are  still  not  very  many 
programs  that  use  the  full  set  of  features  of  Fortran  90  (such  as  modules,  operator  overloading  and 
pointers.)  As  our  test  set  becomes  larger,  we  will  eliminate  existing  bugs. 

A  summary  of  the  project  has  now  been  published  in  (19].  A  follow-up  project  to  redesign 
Sigma  to  support  High  Performance  Fortran  and  HPC++  has  recently  staned  and  is  supponed  by 
DARPA  [20]. 

2.2  Restructuring  Functional  and  Logic  Programs  (D.  Padua  and  D.  Sehr) 

We  have  applied  a  collection  of  transformations  developed  for  Fortran  programs  to  the  paral¬ 
lelization  of  functional  and  logic  programs.  The  overall  objective  was  the  development  of  compiler 
techniques  for  the  parallelization  of  Prolog  programs.  As  functional  programs  comprise  an  inter¬ 
esting  subset  of  logic  programs,  a  number  of  our  techniques  were  applied  to  functional  programs. 
These  transformations  were  then  extended  to  the  full  Prolog  language,  and  a  complete  framework 
for  the  parallelization  of  Prolog  has  been  developed. 

Our  compilation  techniques  were  shown  to  be  capable  of  all  the  transformations  of  functional 
programs  performed  by  Darlington’s  program  synthesis  technique,  and  are  briefly  described  in  a 
paper  that  appeared  in  the  1991  International  Conference  on  Parallel  Proccssing[21].  Moreover, 
our  techniques  are  based  upon  the  dependence  graph  framework  developed  for  Fortran  paral¬ 
lelization,  and  hence  should  be  easily  extensible  to  include  new  transformations,  scheduling,  and 
synchronization  algorithms. 

Having  developed  the  techniques  for  functional  programs,  we  extended  them  to  logic  languages, 
particularly  Prolog.  As  the  control  flow  of  logic  programs  is  very  difficult  to  determine  syntactically, 
our  main  effort  was  to  develop  a  control  flow  framework  in  order  to  construct  dependence  graphs. 
OiiT  techniques  bring  together  a  number  of  Prolog  and  Fortran  techniques  to  expose  loops  in 
recursive  procedures,  and  to  analyze  their  dependences.  In  particular,  abstract  interpretation  is 
used  to  obtain  call/success  modes  and  aliasing,  which  are  used  to  refine  the  flow  graph.  Induction 
variable  analysis  and  linear  constraint  solving  expose  loop  control  variables,  which  are  then  used 
to  convert  the  program  to  a  form  containing  loops.  After  such  transformation,  these  programs  can 
be  processed  by  most  Fortran  techniques.  A  technical  report  describing  these  transformations  was 
produced  in  October  1992(22],  and  a  paper  describing  them  will  be  submitted  to  the  1993  ACM 
Conference  on  Supercomputing. 

To  assess  the  effectiveness  of  these  transformations  a  set  of  instrumentation  programs  was 
developed  which  measure  inherent  parallelism  in  Prolog  programs.  These  programs  measure 
critical  path  times  according  to  two  parallelism  models,  OR  and  AND/OR,  and  have  been  extended 
to  include  the  parallel  loop  forms  used  in  the  above  transformations.  A  paper  describing  these 
transformations  appeared  in  the  1992  International  Conference  on  Fifth  Generation  Computer 
Systems(23],  and  a  revised  and  expanded  version  is  to  appear  in  New  Generation  Computing  early 
in  1993. 
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2.3  Compilation  of  Symbolic  Computations  (W.  L.  Harrison} 

We  have  investigated  the  problem  of  compiling  non-numerical  programs  for  parallel  machines. 
This  effort  considered  the  problems  that  pointer  manipulations  (dynamic  allocation,  casting,  pointer 
arithmetic,  etc.)  pose  fora  parallelizing  compiling.  Miprac  is  the  vehicle  for  this  research. 

Miprac  is  an  umbrella  project,  under  which  several  smaller  projects  are  being  conducted.  Miprac 
itself  is  a  parallelizer  and  conipiler  that  operates  on  a  low-level  intermediate  form  called  MIL,  and 
that  has  especially  powerful  interprocedural  analysis  of  pointers  and  first-class  procedures.  The 
interprocedural  analysis  in  Miprac  is  a  whole-program  abstract  interpretation.  It  provides  the 
compiler  with  precise  side-effect,  dependence  and  object  lifetime  information  that  is  used  for 
extraction  of  parallelism  and  management  of  memory.  Miprac  is  described  in  [24], 

Although  work  on  the  compiler  itself  has  been  discontinued,  several  projects  are  carrying 
forward  the  technology  that  was  introduced  in  Miprac.  We  have  created  a  software  system  called 
Z1  that  is  used  to  create  program  analyses  automatically  from  high-level  specifications.  To  build  a 
compile-time  analysis  of  reference  counts,  for  example,  one  writes  a  operational  semantics  of  MIL 
that  counts  references,  and  Z1  turns  this  semantics  into  a  C  program  that  performs  the  analysis. 
The  analysis  may  be  tuned  (trading  accuracy  away  to  gain  efficiency)  by  a  simple  mechanism 
we  call  projection  expressions.  Z1  is  available  by  anonymous  FTP  from  CSRD,  and  is  described 
in  [25,26]. 

A  second  project  that  is  proceeding  under  the  umbrella  of  Miprac  is  concerned  with  the  compile¬ 
time  analysis  of  explicitly  parallel  programs.  Such  programs  can  neither  be  analyzed  nor  optimized 
by  traditional  methods  for  the  reason  that  they  are  nondeterministic.  We  have  developed  extensions 
to  Miprac ’s  interprocedural  analyses  that  permit  the  analysis  of  side-effects,  dependences  and  object 
lifetimes  in  such  programs.  Using  these  methods,  we  have  successfully  optimized  some  explicitly 
parallel  programs  that  contain  interesting  synchronization.  (The  transformations  that  can  be  applied 
to  such  programs  is  often  highly  counter-intuitive.)  This  work  is  to  be  applied  in  compilers  which 
are  designed  to  accept  a  program  that  has  been  parallelized  by  a  programmer,  but  which  is  to 
be  further  optimized,  either  for  the  extraction  of  additional  parallelism,  for  the  management  of 
complex  memory  resources,  or  simply  for  the  sake  of  improving  the  running  time  of  sequential 
sections  of  the  code  (as  mentioned  above,  even  ordinary  optimizations  cannot  be  applied  safely  to 
such  programs  without  such  an  analysis).  This  work  is  described  in  [27]. 

3  Debugging  and  Data  Visualization  Tools 

After  a  program  has  been  written  down  as  source  code  and  successfully  compiled  into  object 
code,  the  behavior  of  the  program  for  input  datasets  must  be  examined.  Referring  to  Figure  1 , 
the  code  has  moved  from  Phase  3  to  Phase  4.  Two  activities  take  place  at  this  stage.  First,  the 
essential  correctness  of  the  source  code  must  be  verified,  and  second  the  essential  correctness  of 
the  algorithm  must  be  verified. 

The  first  activity  finds  and  eliminates  errors  that  occured  when  the  physical  problem  was 
mapped  into  source  code  (Phase  1  to  Phase  2).  Realistically,  it  sometimes  also  uncovers  errors  that 
crept  in  between  Phase  2  and  Phase  3  because  of  errors  in  the  compiler  software.  This  process 
requires  interactive  debugging  tools. 

The  second  activity  finds  errors  at  a  higher  level  of  abstraction.  Perhaps  the  algorithms  that 
the  programmer  chose  in  Phase  2  do  not  faithfully  reflect  the  underlying  physical  process.  Perhaps 
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some  valid,  but  unexpected,  result  has  occurred  and  the  underlying  cause  must  be  uncovered.  These 
tasks  require  a  flexible  facility  for  visualizing  the  program’®  '^ata  interactively,  so  that  the  user  has 
the  ability  at  runtime  to  peruse  the  program  data  in  whatever  fashion  his  or  her  investigations  may 
dictate.  A  suite  of  powerful  visualization  methods  is  required.  Further,  it  should  be  possible  for 
the  user  to  modify  the  state  of  the  program  at  runtime,  perhaps  by  adding  some  additional  code  to 
be  executed  at  key  places,  in  order  to  experiment  with  new  or  modified  algorithms.  These  activites 
require  interactive  data  visualization  and  application  steering  tools. 

3.1  MDB  —  Xylem  Parallel  Debugger  (P.  Emrath) 

For  a  machine  to  be  useful  users  must  be  able  to  write  and  debug  programs  for  the  machine. 
Debugging  parallel  programs  is  harder  than  debugging  sequential  programs  because  parallel  pro¬ 
grams  tend  to  be  more  complicated  than  their  sequential  counterparts.  Parallel  programs  can  consist 
of  multiple  tasks  which  run  on  multiple  machines  having  both  private  and  shared  address  spaces. 
Hence,  the  debugging  of  parallel  programs  requires  additional  capabilities  that  are  not  available 
with  traditional  debuggers  for  sequential  programs. 

MDB  is  a  parallel  debugger  for  Cedar[28]  and  its  operating  system  Xylem[29]  that  provides 
facilities  for  examining  the  data  and  control  structures  of  a  Xylem  process.  As  a  prototype,  mdb 
does  not  provide  all  the  services  one  would  find  in  a  traditional  sequential  debugger,  such  as  the 
ability  to  modify  data  or  to  set  breakpoints.  The  debugger  can  show  the  machine  registers  for 
any  processor,  the  status  of  any  task,  and  memory  locations  within  the  private  address  space  of 
a  task,  or  in  the  shared  address  space.  Even  with  such  limited  capabilities,  it  has  proven  to  be 
a  valuable  tool  for  debugging  actual  programs  on  Cedar.  We  believe  we  have  demonstrated  the 
design  approach  with  very  positive  results.  From  this  point,  it  seems  to  be  merely  an  exercise  to 
add  all  the  functionality,  such  as  breakpoints,  one  has  come  to  expect  in  a  complete  debugger. 

The  debugger  is  available  in  two  versions.  A  reduced  version  of  MDB  allows  for  the  examination 
of  the  memory  and  registers  of  a  process  and  can  display  stack  traces.  The  complete  version  of 
the  debugger  supports  all  the  features  of  the  reduced  version  with  additional  capabilities  such  as 
decoding  (disassembly)  of  instructions  and  using  the  symbol  table  for  symbolic  addressing.  In 
either  version,  MDB  can  be  used  interactively  or  it  will  dump  error  status  when  entered  from  a 
non-in  teractive  program. 

The  design  of  MDB  is  different  from  many  interactive  debuggers  in  that  rather  than  being  a 
separate  process  that  controls  the  target  program  through  the  operating  system,  mdb  is  a  set  of 
modules  that  are  linked  with  the  target  program  when  creating  an  executable  file.  This  means  that 
MDB  is  not  totally  isolated  from  the  target,  but  it  has  the  advantages  of  being  relatively  simple  to 
implement  and  can  examine  large  amounts  of  program  state  very  efficiently.  Both  these  advantages 
stem  from  the  fact  that  program  state  is  directly  accessible  without  requiring  any  operating  system 
services. 

The  capabilites  of  MDB  have  evolved  as  it  was  used  to  debug  itself  during  development.  The 
user  interface  was  repeatedly  refined  as  continued  use  suggested  changes  to  make  debugging  easier. 
Feedback  from  the  user  community  has  generally  been  favorable  and  a  significant  number  of  real 
bugs  have  been  easily  and  quickly  found  once  the  faulty  program  was  linked  with  the  complete 
version  of  MDB.  Even  without  the  enhancements  of  being  able  to  change  values  in  a  process  or  to 
set  breakpoints  interactively,  MDB  has  proven  to  have  a  very  practical  design  and  to  be  a  useful  tool 
for  debugging  parallel  programs. 
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A  detailed  description  of  MDB’s  user  interface  is  contained  in  (30j.  An  overview  of  MDB  was 
presented  at  the  1991  Supercomputing  Debugging  Workshop[31]. 

3.2  Visualization  and  Application  Steering  Environments  (J.  Bruner,  R.  Haber,  A.  Tuchman, 
B.  Bliss,  D.  Jablonowski.  D.  Hammerslag,  G.  Cybenko) 

Visualization  allows  the  user  of  a  scientific  application  to  apply  his  or  her  innate  visual  pattern 
recognition  skills  to  comprehend  the  application’s  data.  The  results  generated  by  today’s  large 
scientific  and  engineering  codes  often  comprise  large  datasets,  perhaps  several  different  arrays  of 
values  that  must  be  interpreted  together  in  order  to  be  understood.  Even  for  a  single  application, 
the  datasets  to  be  displayed,  and  the  display  method  to  be  used,  may  vary  considerably  from 
one  problem  to  the  next.  For  example,  some  problems  may  call  for  simple  displays  such  as  a 
two-dimensional  color-coded  plot  of  a  scalar  field,  while  in  other  cases  more  sophisticated  displays 
such  as  two-  or  three-dimensional  vector  fields  may  be  required. 

Further,  in  many  cases,  the  user  of  a  scientific  or  engineering  application  is  not  interested  solely 
in  the  answer  to  a  problem.  Often,  the  reasons  behind  the  answer  are  as  interesting  as,  or  even 
more  interesting  than,  the  answer  itself.  When  confronted  with  a  particular  solution,  the  user  may 
wish  to  play  “what  if”  games.  For  example:  “What  if  the  time  step  were  shortened?”  “What  if  the 
electric  field  strength  were  increased?” 

Thus,  the  user  of  a  scientific  application  needs  a  visualization  system  that  provides  two  additional 
capabilities.  The  first  capability  is  “runtime  visualization”  —  the  ability  lo  select  datasets  and  their 
display  methods  interactively  at  runtime.  This  requires  a  suite  of  powerful  display  methods,  together 
with  a  mechanism  for  constructing  datasets  at  runtime.  The  second  capability  is  “application 
steering”  —  an  interactive  means  to  modify  the  state  of  the  program  at  runtime.  This  could  include 
adding  some  additional  code  to  the  program  to  be  executed  at  key  places,  in  order  to  experiment 
with  new  or  modified  algorithms. 

Further,  the  visualization  system  should  be  able  to  operate  in  a  distributed  environment.  In  a 
typical  case,  an  engineering  design  code  running  on  a  large  computational  engine  might  be  linked  to 
a  visualization  program  running  on  a  rendering  engine,  and  the  entire  system  would  be  configured 
and  controlled  from  a  workstation.  Such  a  distributed  nature  would  allows  the  user  to  draw  upon  the 
strengths  of  different  platforms  (computation,  rendering,  user  interaction)  in  a  single  environment. 

CSRD  developed  two  generations  of  runtime  visualization  and  application  steering  software. 
Experience  gained  with  the  first  system,  Vista,  was  incorporated  in  the  design  of  the  second,  VASE. 
The  first  version  of  the  VASE  system  was  completed  in  November  1992,  and  we  are  currently 
planning  future  extensions  to  the  project. 

3.2.1  Vista 

Vista  is  an  architecture  and  a  framework  for  distributed  run-time  visualization  of  data.  It 
provides  a  window  into  an  application  by  showing  program  data  automatically  during  execution. 
The  system  architecture  is  designed  for  a  distributed  or  remotely  executing  application;  however, 
the  Vista  model  allows  a  data  or  trace  file  to  replace  the  executing  application,  providing  a 
visualization  “data  browser”  for  existing  data  or  simulation  runs.  The  data  to  be  displayed  and 
the  type  of  display  to  be  used  are  chosen  interactively  while  the  application  is  executing.  It  is 
not  necessary  to  specify  the  data  or  graphics  technique  before  compilation  as  with  conventional 
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graphics  tools.  With  minimal  instrumentation,  an  application  run  in  the  Vista  environment  will 
have  its  data  (variables  and  data  structures)  made  available  to  a  visualization  system  on  a  remote 
workstation.  Any  data  display  can  be  enabled  or  disabled  at  any  time.  The  application  may  execute 
locally,  on  a  remote  supercomputer,  on  several  clusters  of  a  shared  memory  computer,  or  even 
across  a  network  of  distributed  computers. 

Vista’s  major  user-visible  modules  are  the  Visualization  Manager  (VM)  and  the  Application 
Executive  (AE)[32].  The  VM  is  the  user  interface  and  display  manager.  It  runs  as  a  separate 
process  from  the  application.  The  AE  is  co-resident  with  the  application  program;  therefore,  it  runs 
in  the  context  of  the  application  and  has  access  to  its  entire  address  space.  The  application  need 
not  run  on  the  same  machine  as  the  VM.  Other  modules,  which  are  not  directly  visible  to  the  user, 
provide  data  transfer  and  synchronization  between  the  VM  and  the  AE. 

To  use  Vista,  the  programmer  prepares  the  source  code  by  defining  vis-points,  visualization 
breakpoints.  These  represent  significant  places  within  the  program  where  graphical  displays  should 
be  updated  or  where  the  user  may  wish  to  interact  with  the  code.  The  programmer  defines  vis-points 
by  inserting  calls  to  the  Vista  routine  avbreak. 

The  prepared  source  code  is  then  compiled  and  linked  with  the  Vista  runtime  library  (which 
includes  the  AE).  To  execute  the  application,  the  user  first  starts  the  VM  (typically  on  the  local 
workstation)  and  then  starts  the  application  (typically  on  a  remote  supercomputer).  When  the 
application  reaches  the  first  call  to  visbreak  it  will  initialize  the  AE  and  connect  to  the  VM.  As 
part  of  the  initialization,  the  AE  will  read  the  symbol  table  for  the  application  program,  thereby 
obtaining  information  about  all  of  the  variables  in  the  program. 

At  this  time  the  user  may  ask  for  and  display  graphically  any  variables  in  the  program.  The 
form  in  which  the  variables  are  displayed  is  determined  by  the  user’s  choice  of  display  method. 
Vista  provides  a  basic  set  of  display  methods.  In  addition,  external  display  methods  may  be  used 
When  an  external  display  method  is  employed,  the  VM  passes  the  appropriate  data  to  the  display 
method  through  a  communication  channel.  This  capability  allows  Vista  to  employ  such  powerful 
visualization  systems  as  AVS[33]  or  ApE[34]. 

An  initial  version  of  Vista  was  completed  and  distiibuted.  However,  work  on  the  Vista  project 
was  suspended  when  development  of  VASE  began.  VASE  borrowed  from  Vista,  both  in  its  design 
concepts  and  its  implementation  (c.g.,  VASE  employs  a  derivative  of  the  Application  Executive 
in  Vista).  Tlie  Vista  project  culminated  with  the  presentation  of  two  papers  in  1991,  one  at  the 
Fifth  SIAM  Conference  on  Parallel  Processing  for  Scientific  Compud!lg[35]  and  the  second  at 
Visualization  ’91  [36]. 

3.2.2  VASE 
Overview 

Work  on  VASE  began  at  CSRD  in  1991,  and  version  1.0  was  completed  in  November  1992. 
The  VASE  project  grew  out  of  previous  experience  in  high-performance  distributed  visualization, 
including  the  RIVERS  project  [37,38]  at  the  National  Center  for  Supercomputing  Applications 
as  well  as  the  Vista  project  (Section  3.2.1).  VASE  also  relies  upon  the  program  analysis  and 
restructuring  tools  developed  as  part  of  the  Sigma  project  (Section  2.1). 

VASE  is  a  collection  of  programming  tools  and  system  software  that  add  runtime  visualization 
and  application  steering  capabilities  to  application  codes.  VASE  supports  the  activities  of  three 
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different  classes  of  users  during  an  application’s  lifetime:  the  code  developer,  the  distributed 
application  configurer,  and  the  end  user.  At  each  stage,  VASF  collects  and  organizes  information 
about  the  program,  which  it  then  communicates  to  users  in  later  stages. 

VASE  employs  both  control-flow  and  data-flow  paradigms.  Visualization  systems  typically  are 
based  upon  the  data-flow  model,  in  which  data  passes  from  one  functional  module  to  another.  Each 
module  waits  until  sufficient  data  is  present  at  its  inputs,  at  which  time  it  “fires,”  producing  output 
data  that  is  transmitted  to  other  modules  downstream.  This  paradigm  is  attractive  because  it  natu¬ 
rally  accomodates  asynchronous  distributed  computing,  and  it  more  easily  allows  reconfiguration 
of  a  set  of  functional  modules  (e.g.,  to  substitute  one  display  method  for  another);  therefore,  VASE 
uses  the  data-flow  paradigm  to  manipulate  the  interactions  between  a  set  of  (possibly  distributed) 
processes. 

VASE  communications  are  presently  ba-^ed  upon  DTM[39].  The  endpoints  are  called  ports. 
Pons  are  unidirectional,  and  can  be  associated  with  an  entire  process  or  only  with  a  single  breakpoint 
within  that  process.  A  VASE  connection  is  a  communication  link  that  connects  an  output  pon  to 
an  input  pon.  Reads  and  writes  on  pons  can  be  defined  as  either  blocking  or  non-blocking. 

Although  the  data-flow  model  is  well-suited  to  the  interprocess  communication  used  in  dis¬ 
tributed  visualization,  it  is  cumbersome  when  used  to  express  the  complex  logic  typically  found  in 
large-scale  scientific  applications.  It  also  does  not  reflect  the  thought  process  of  an  developer  who 
writes  code  in  Fortran.  By  contrast,  a  control-flow  model  can  readily  represent  an  application’s 
internal  structure.  The  structure  of  the  code  can  be  defined  by  a  control  flow  graph,  where  the  nodes 
in  the  graph  represent  functional  blocks  and  the  arcs  are  transitions  betweer:  those  block.  VASE 
uses  the  control-flow  model  to  describe  the  internal  structure  of  processes  and  identify  appropriate 
locations  for  visualization  and  steering  operations. 

Building  a  VASE  Application 

The  code  developer  writes  the  source  code.  Using  VASE  tools,  he  or  she  defines  major 
functional  blocks  in  the  program;  steering  breakpoints,  which  are  intermediate  points  at  which 
interaction  with  the  data  in  the  application  is  meaningful;  and,  for  each  steering  breakpoint,  the 
variables  that  can  be  read  or  written. 

Figure  2  shows  how  a  steerable  application  is  built  using  VASE  tools.  The  code  developer 
inserts  directives  into  the  source  code  to  identify  major  functional  blocks  in  the  program.  (These 
directives,  which  take  the  form  of  “pseudo  comments”  are  the  only  changes  that  must  be  made  to 
the  original  source  code).  The  VASE  program  graph  builder,  which  is  based  upon  Sigma,  reads  the 
annotated  source  code  and  constructs  a  hierarchical  control-flow  graph. 

A  second  VASE  tool,  the  breakpoint  insertion  tool,  displays  this  graph  visually.  The  developer 
may  peruse  the  graph,  expanding  and  collapsing  hierarchically-nested  blocks.  By  clicking  upon 
the  arcs  between  blocks,  the  developer  can  insert  steering  breakpoints  —  places  in  the  program 
where  steering  operations  can  take  place.  (Steering  breakpoints  in  VASE  are  roughly  analogous 
to  vis-points  in  Vista.)  For  each  visualization  breakpoint,  the  developer  selects  the  variables  from 
that  portion  of  the  program  that  should  be  made  available  for  read  access  (e.g.,  visualization)  and 
write  access  (steering). 

Another  Sigma  based  tool,  the  breakpoint  code  generator,  combines  the  breakpoint  information 
with  the  hierarchical  control  flow  graph  and  generates  source  code  consisting  of  the  originai 
program  augmented  with  calls  to  the  VASE  breakpoint  library.  This  source  code  is  then  compiled 
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Figure  2:  Building  a  Steerable  Application  with  VASE 
and  linked  with  the  VASE  library  to  form  a  steerable  executable  image. 

Application  Configuration 

The  distributed  application  configurer  combines  a  set  of  application  programs  and  visualization 
programs  into  a  distributed  system.  Based  upon  information  created  by  the  code  developer,  the 
configurer  defines  communication  channels  between  the  programs  and  defines  how  the  individual 
codes  will  exchange  data. 

Compiled  languages  promote  efficient  execution,  while  interpreted  languages  encourage  inter¬ 
action  and  allow  the  program  logic  to  be  modified  at  run  time.  VASE  combines  the  best  features 
of  the  two  approaches.  VASE  includes  a  breakpoint  script  interpreter,  based  upon  the  Application 
Executive! 32],  that  is  activated  when  a  program  encounters  a  steering  breakpoint  The  interpreter 
parses  and  executes  breakpoint  scripts,  consisting  of  actions  for  inter-process  commu'^ication  and 
steering.  In  effect  the  interpreter  provides  a  high-level  debugging  facility.  However,  in  constrast 
with  conventional  debuggers,  the  interpreter  shares  the  application  program’s  address  space,  giving 
it  efficient  access  to  large  blocks  of  data. 

VASE  includes  a  graphical  configuration  and  execution  control  tool.  The  configurer  uses  this 
tool  to  assign  executable  processes  to  host  processors,  to  set  up  communication  links  between 
processes,  and  to  configure  steering  breakpoints.  Breakpoints  may  cause  the  process  to  pause  and 
await  interactive  input,  to  execute  a  script,  to  execute  a  script  and  then  pause,  or  to  have  no  effect 
at  all. 

The  configurer  will  typically  define  scripts  for  applications  that  access  variables  of  interest  in 
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the  application  code  and  write  them  to  VASE  output  pons.  In  addition,  he  or  she  will  define  scripts 
for  the  visualization  processes  to  read  data  from  the  VASE  pons  and  display  it. 

Application  Execution 

The  end  user  runs  the  configured  distributed  application.  In  many  cases  he  or  she  will  specify 
an  input  dataset  and  will  observe  the  progress  of  the  application  through  a  set  of  visual  displays 
defined  by  the  configurer.  In  addition,  the  information  created  by  the  code  developer  and  the 
configurer  is  available  to  the  end  user.  The  end  user  can  exploit  this  information  to  change  the 
visualization  technique,  examine  different  data  sets,  and  even  modify  the  behavior  of  the  program. 

Like  the  configurer,  the  end  user  employs  the  VASE  configuration  and  execution  tool.  The  end 
user  can  modify  scripts  created  by  the  configurer,  can  add  or  delete  processes  from  the  configuration, 
can  and  can  create  or  delete  communication  ports  and  connections. 

Future  Plans 

VASE  version  1.0,  comprising  the  program  graph  builder,  breakpoint  insertion  tool,  breakpoint 
code  generator,  and  configuration  and  execution  tool  was  completed  in  November  1992.  A 
distributed  supercomputing  application  based  on  VASE  for  shape  optimization  of  structures  of 
variable  topologies  was  created  and  reported  in  [40]  and  also  formed  the  basis  for  a  demonstration 
videotape[41]. 

33  Visibility  Ordering  of  Meshed  Polyhedra  {P.  Williams) 

A  visibility  ordering  of  a  set  of  objects  from  some  viewpoint  is  an  ordering  such  that  if  object  a 
obstructs  object  b,  then  b  proceeds  a  in  the  ordering.  Certain  visualization  techniques,  particularly 
direct  volume  rendering  based  on  projective  methods  require  a  visibility  ordering  of  the  polyhedral 
cells  of  a  mesh  so  the  cells  can  be  rendered  using  color  and  opacity  compositing.  Visibility  ordering 
the  cells  of  rectilinear  meshes  (or  certain  classes  of  regular  meshes  based  on  a  decomposition  of  a 
rectilinear  mesh)  is  straightforwafd[42].  However,  for  other  types  of  meshes,  such  as  curvilinear 
or  unstructured  meshes,  it  is  not  immediately  obvious  how  to  compute  this  ordering. 

We  have  developed  a  simple  and  efficient  algorithm  for  visibility  ordering  the  cells  of  any 
acyclic  convex  set  of  meshed  convex  polyhedra.  This  algorithm,  called  the  Meshed  Polyhedra  Vis¬ 
ibility  Ordering  (MPVO)  Algorithm,  orders  the  cells  of  a  mesh  in  linear  time  using  linear  storage. 
Preprocessing  techniques  and/or  modifications  to  the  MPVO  Algorithm  permit  nonconvex  cells, 
nonconvex  meshes  (meshes  with  cavities  and/or  voids),  meshes  with  cycles,  and  sets  of  discon¬ 
nected  meshes  to  be  ordered.  The  MPVO  Algorithm  can  also  be  used  for  domain  decomposition 
of  finite  element  meshes  for  parallel  processing.  The  data  structures  for  the  MPVO  Algorithm  can 
be  used  to  solve  the  spatial  point  location  problem. 

The  MPVO  Algorithm  was  used  to  generate  the  picture  of  a  scientific  data  set  of  approximately 
70,000  tetrahedra  which  is  shown  on  the  cover  of  [43].  The  basic  ideas  of  the  MPVO  Algorithm 
were  suggested  by  Herbert  Edelsbrunner  in  a  conversation  regarding  his  paper  on  the  acyclicity  of 
cell  compIexes[44].  A  similar  algorithm  to  the  MPVO  Algorithm,  also  based  on  Edelsbrunner ’s 
suggestions,  was  developed  independently  by  Max,  Hanrahan,  and  Crawfis[45]. 
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The  Binary  Space  Panition  (BSP)  tree  algorithm[46j  is  not  suitable  for  visibility  ordering  large 
meshes  because  the  algorithm  uses  splitting  planes  (even  when  not  required  to  break  cycles). 
Because  the  cells  are  meshed,  a  large  number  of  cells  could  be  split,  resulting  in  a  potential 
explosion  in  the  total  number  of  cells.  Analysis  of  the  BSP  tree  algorithm  by  Paterson  and  Yao(47] 
suggests  that  its  performance  could  be  0(p ),  where  /  is  the  number  of  faces  in  the  original  mesh. 
An  A-buffer[48]  is  also  not  suitable  for  visibility  ordering  large  meshes  because  there  are  too  many 
transparent  cells  at  each  pixel,  making  memory  requirements  prohibitive  with  current  hardware. 
Goad[49]  describes  a  special  purpose  program  written  in  Lisp  for  visibility  ordering  polygons. 
This  approach  might  be  adapted  for  polyhedra;  further  investigation  may  be  warranted.  The  brute 
force  algorithm,  for  visibility  ordering  an  acyclic  mesh,  calculates  an  obstructs  relation  for  every 
pair  of  the  n  cells  in  the  mesh,  and  requires  at  least  0{  )  cycles. 

Using  the  MPVO  algorithm  as  a  basis,  we  developed  a  parallel  (MIMD)  volume  rendering 
system  for  scalar  data  defined  on  irregular  or  curvilinear  meshes.  In  the  process,  the  MPVO  algo¬ 
rithm  was  parallelized.  The  particular  platform  utilized  was  a  Silicon  Graphics  VGX  4D/360VGX 
with  6  CPUs.  The  focus  of  the  system  is  to  visualize  3D  finite  element  and  scattered  data.  (When 
scattered  data  is  triangulated  it  results  in  an  irregular  data  set.) 

The  volume  tenderer,  based  on  cell  projection  rather  than  ray  tracing,  utilizes  a  hierarchy 
of  rendering  approximations,  graphics  hardware,  preprocessing  and  filtering  techniqucs{50,51]. 
Using  these  techniques,  volumetrically  rendered  images  of  curvilinear  and  irregular  data  sets  with 
over  1,000,000  cells  were  generated  in  15  seconds  (without  filtration)[50].  Using  filtering,  similar 
performance  is  possible  for  even  larger  data  sets.  Data  sets  with  up  to  1(X),0(X)  cells  were  rendered 
in  less  than  2  seconds. 

In  addition,  a  volume  density  optical  model  was  deveIoped[52].  This  model  was  developed  as 
a  theoretical  basis  for  volume  rendering  finite  element  data  as  opposed  to  scanned  data  sets  where 
material  classification  is  required. 

The  National  Center  for  Supercomputing  Applications  is  currently  productizing  the  Tenderer 
(constructing  a  user-friendly  interface  for  it)  so  it  can  be  made  available  to  scientists. 

4  Performance  Instrumentation  and  Visualization  Tools 

Once  a  large  code  is  running  successfully  on  a  target  machine,  the  focus  of  a  developer’s 
attention  moves  from  obtaining  correct  results  to  improving  performance.  Referring  to  Figure  1, 
the  attention  shifts  from  Phase  4  to  Phase  5. 

Supercomputer  codes,  by  the  very  nature,  are  intensive  consumers  of  computational  resources. 
Therefore,  it  is  essential  that  they  run  as  efficiently  as  possible.  However,  it  is  also  the  nature  of 
supercomputer  codes  to  be  large  and  complex.  As  a  result,  the  accumulation  of  years  of  incremental 
improvements  may  produce  a  lengthy  code  with  which  no  one  person  is  expert. 

Users  commonly  employ  profiling  tools  to  assist  in  this  task.  A  profiling  tool  can  provide 
high-level,  general  statistical  information  on  which  code  segments  consume  the  most  time.  A 
major  advantage  of  most  profiling  tools  is  that  they  are  relatively  automatic,  requiring  little  user 
intervention.  They  are  also  available  for  a  wide  variety  of  machines. 

Unfortunately,  most  profiling  tools  are  limited  to  providing  a  list  of  the  most  well-optimized 
subroutines.  For  many  parallel  machines  or  vector  supercomputers,  the  most  interesting  data  falls 
at  the  loop  level  and  is  invisible  to  the  profiler. 
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Another  problem  with  existing  profilers  is  that  they  are  machine-specific.  In  order  to  make 
meaningful  direct  comparisons  between  machines,  users  need  results  that  are  commensurable. 
This  is  possibly  the  most  outstanding  weakness  of  existing  performance  evaluation  tools.  Table  1 
shows  the  fifteen  mo.st  time-consuming  routines  as  measured  by  GPROF  on  an  Alliant  FX/80,  an 
Alliant  FX/2800,  and  a  Sun  Sparcstation-1.  It  is  difficult  to  recognize  from  those  results  that 
the  same  code  is  being  run  on  the  three  machines.  The  user  is  required  to  divine  the  purpose  of 
implementation-specific  routines  with  cryptic  names  such  as  mcount  and  —unlock.wo.prof , 
which  are  not  routines  in  his  code.  Furthermore,  users  want  the  time  spent  in  library  functions 
(-dexp-fortran.val-,  etc.)  attributed  to  the  routines  using  those  functions,  not  to  the  library 
routines.  While  performance  data  for  system  functions  like  mcount  does  provide  information, 
most  users  want  information  only  about  the  routines  in  their  code  over  which  they  exercise  control. 
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Table  1;  GPROF  Results  of  One  Code  on  Three  Machines 

Many  widely-used  profilers  (e.g.,  the  GPROF  utility  on  many  Unix  systems)  collect  data  by 
sampling  the  program  counter  at  regular  intervals.  The  resulting  data  represents  the  average 
amount  of  time  spent  in  each  portion  of  the  code  during  the  sampling  interval,  but  much  detail  is 
lost.  If  a  routine  performs  identical  work  on  each  invocation  but  these  invocations  have  drastically 
differing  execution  times,  this  information  can  indicate  important  effects  such  as  paging  from  virtual 
memory  or  cache  us>*ge.  Sampling,  which  averages  over  the  entire  execution  of  the  program,  misses 
these  differences.  Further,  sampling  cannot  easily  account  for  varying  amounts  of  work  performed 
in  different  invocations.  Consider  the  program  shown  in  Figure  3.  This  program  was  executed  and 
profiled  on  an  RS/6000  with  the  sampling-based  tool  GPROF,  and  the  resulting  performance  data  is 
shown  in  Figure  4. 

Figure  4  indicates  that  100%  of  the  time  was  spent  in  C,  which  was  called  200  times.  Further, 
it  states  that  C  was  called  100  times  from  A  and  spent  1.27  seconds  when  C  was  invoked  from  A 
(second  row).  Similarly,  the  profile  states  that  C  spent  1.27  seconds  when  C  was  invoked  from  B. 
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DO  100  I  =  1, 10 
CALL  A 
CALL  B 

100  CONTINUE 
STOP 
END 

SDBRODTINE  A 
DO  20  1  -  1,  10 

CALL  C(10) 

20  CONTINUE 

RETURN 
END 

SUBROUTINE  B 
DO  30  I  -  1,  10 
CALL  C(IOO) 

30  CONTINUE 

RETURN 
END 

SUBROUTINE  C(K) 

DO  300  J  -  1,  K 
DO  100  I  -  1,  K 
100  CONTINUE 

300  CONTINUE 
RETURN 
END 


Figure  3:  An  Example  Program 
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Figure  4: 

Dynamic  Call  Graph  for  the  Example  Program 
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(The  rest  of  the  figure  describes  statistics  for  the  other  sections  of  the  program.)  However,  it  is 
clear  from  inspection  of  the  source  code  that  this  sampling-based  analysis  is  incorrect;  most  of  the 
time  is  being  spent  in  subroutine  C  when  C  is  being  called  from  B. 

For  these  reasons,  we  have  developed  a  set  of  performance  tools  based  upon  source-level 
instrumentation  and  trace  collection.  Section  4.1.1  describes  Trace  View,  a  tool  for  examining  these 
traces.  Section  4.1.2  describes  the  tools  CPROF  and  Super Vu,  which  employ  structured  traces  to 
provide  both  summary  profiling  and  trace  perusal  capabilities. 

Measurements  of  a  code  on  a  machine  provide  an  accurate  view  of  the  performance  currently 
being  obtained,  as  well  as  the  areas  where  the  code  performs  well  or  fails  to  perform  well.  However, 
when  faced  with  the  task  of  improving  a  code’s  performance,  a  designer  often  needs  to  know  not 
only  the  achieved  performance  but  also  the  potential  for  performance  improvement.  In  effect,  he 
or  she  needs  a  tool  that  can  give  some  quantitative  assessment  of  what  the  performance  could  be. 
Section  4.2  describes  how  we  modified  the  MaxPar  performance  simulator[53],  written  at  CSRD, 
to  fill  this  role. 

4.1  IVace-Based  Performance  Tools  (J.  Bruner,  G.  Cybenko,  D.  Hammerslag,  D.  Jablonowski, 
A.  Malony,  S.  Sharma) 

4.1.1  TVaceView 

Trace  View  is  an  event-trace  display  and  analysis  tool.  It  supports  the  viewing  of  timeline 
displays  of  event  occurrences  together  with  performance  metrics. 

TTie  Trace  View  architecture  is  based  on  the  concept  of  a  trace  visualization  session.  A  session 
consists  of  files,  views,  and  displays.  The  topmost  level  of  a  session  specifies  a  set  of  trace  files 
to  visualize.  For  each  file,  a  set  of  views  can  be  constructed.  A  view  is  a  template  that  defines  a 
region  of  the  trace,  including  the  beginning  location,  the  ending  location,  and  event  filtering.  Then, 
for  each  view,  a  set  of  displays  can  be  created.  Although  views  and  displays  can  only  be  defined 
within  a  single  trace  file,  (therefore  disallowing  displays  that  combine  data  from  multiple  trace 
files),  multiple  trace  files  can  be  displayed  simultaneously  in  the  same  session. 

Trace  View  operates  upon  trace  files  that  contain  a  series  of  event  records,  where  an  event 
represents  a  transition  from  one  state  to  another.  (Trace \fiew  assigns  no  meaning  to  the  states 
themselves.  Typically,  the  user  will  instrument  an  application  code  with  subroutine  calls  to  an 
trace-generation  library.  Each  such  call  will  use  a  unique  identifier.  In  this  case  the  trace  file 
describes  the  movement  of  the  program  between  the  trace  points.)  In  addition  to  encoding  the 
current  and  previous  state,  the  event  record  contains  an  event  type  and  a  time  stamp.  Events  may 
also  include  other  data;  for  instance,  on  the  Cray  Y-MP  event  records  may  contain  data  from  the 
hardware  performance  monitors. 

Trace  View  displays  events  as  a  set  of  timelines  that  show  the  event  transitions  and  the  associated 
other  information  (e.g.,  HPM  performance  metrics).  The  display  methods  and  user  interface  are 
based  upon  XI 1  and  Motif.  Trace  View  supports  many  user  interface  functions,  including  the  ability 
to  save  session  configurations  between  invocations. 

Trace  View’s  capabilities  and  user  interface  are  described  in  detail  in  [54]. 

4.1.2  CPROF  and  SuperVu 

CPROF  is  a  trace-based  profiler  for  serial  and  shared-memory  p.arallel  computers.  An  automatic 
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instrumentation  tool  augments  the  application  source  code  with  calls  to  a  tracing  library.  This 
approach  is  similar  to  Cray  Research’s  FlowTrace[55);  however,  unlike  Flowtrace,  CPROF  employs 
source-level  instrumentation  and  is  therefore  widely  portable. 

The  tracing  routines  reflect  the  application’s  structure.  The  execution  of  the  application  is 
divided  into  regions  delimited  by  a  pair  of  instrumentation  points  where  a  routine  in  the  tracing 
library  is  invoked.  An  auton^atic  instrumentation  tool  which  is  based  on  the  Cedar  Fortran[56] 
preprocessor  marks  subroutines  and  loop  nests  in  Fortran  programs  automatically.  Users  may 
manually  insen  additional  instrumentation  points  to  provide  additional  detail  in  some  areas  of  the 
program. 

The  instrumented  program  is  linked  with  the  tracing  library  and  executed.  At  link  time,  the  use 
may  choose  a  tracing  library  that  employs  special  hardware  (such  as  the  tracing  hardware  available 
for  Cedar[28,57]),  or  a  portable  software-based  tracing  library  that  does  not  require  any  special 
hardware. 

As  the  program  runs  it  produces  a  set  of  trace  records.  Each  trace  record  includes  the  identity 
of  the  trace  point  and  other  relevant  performance  information,  such  as  elapsed  or  CPU  time.  In 
addition,  the  structure  of  traces  allows  for  machine-specific  extensions  on  platforms  that  provide 
additional  information  (such  as  directly-accessible  hardware  performance  counters). 

Cprof  is  an  analysis  tool  that  reads  traces  and  prints  summary  statistics.  It  profiles  the 
application  using  wall-clock  time  and  computes  statistics  based  upon  the  exclusive  timings  of 
code  regions  (the  time  spent  in  each  region  minus  the  time  spent  in  its  children).  Cprof  provides 
information  on 

•  the  amount  of  time  the  machine  spent  executing  code  in  the  named  code  segment  (subroutine, 
basic  block,  loop), 

•  the  number  of  times  the  code  segment  (subroutine,  basic  block,  loop)  was  executed. 

•  statistical  variations  in  the  timing  of  the  code  segment,  the  minimum  ,  maximum,  average, 
standard  deviation  times  and  the  instances  where  these  times  have  occurred. 

•  detailed  execution  profile  of  each  code  segment  for  performance  tuning. 

•  a  dynamic  calling  tree  of  all  the  code  segments. 

•  input  data  for  further  theoretical  performance  analysis  by  other  tools. 

SuperVu  is  an  interactive  performance  visualization  tool.  Super Vu  reads  the  same  event 
trace  files  that  CPROF  reads,  but  unlike  CPROF  it  allows  the  user  to  browse  through  the  data 
interactively.  SUPER Vu  runs  under  X 1 1/Motif,  and  we  have  ported  it  to  multiple  platforms  including 
Sparcstations,  Sun  3’s  and  the  Alliant  FX/28()0. 

SuperVu  provides  a  different  set  of  displays  than  Trace  View  (Section  4.1.1).  While  Trace- 
View  does  not  associate  any  particular  meaning  to  events,  SuperVu’s  analysis  is  based  upon  the 
knowledge  that  events  come  in  entry-exit  pairs.  Thus,  SuperVu  is  able  to  refine  the  raw  event  data 
to  display  derived  values  such  as  the  amount  of  time  spent  within  each  code  region,  the  number  of 
times  each  region  was  executed,  etc. 

The  SuperVu  user  may  choose  from  a  variety  of  performance  data  displays  for  each  process 
in  the  application.  Their  diversity  allows  the  user  to  analyze  the  application  at  multiple  levels.  The 
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displays  include  flat  profiles,  detailed  profiles,  dynamic  calling  trees,  and  overall  process  behavior. 
These  are  displayed  in  a  visually  informative  manner. 

SuperVu  is  able  to  process  information  about  the  behavior  of  the  computer  system  while  the 
application  is  running.  SuperVu  treats  the  context  switches  as  a  code  segment  within  the  program 
and  treats  this  code  segment  the  same  as  other  code  segments.  Among  other  things  it  can  subtract 
the  time  spent  “switched  out”.from  the  execution  time  of  the  program. 

A  more  detailed  description  of  CPROF  and  SUPERVu,  together  with  a  case  study  for  ARC2D, 
one  of  the  Perfect  Benchmarks®,  has  been  submitted  for  publication(58].  Additional  information 
about  the  tools  can  also  be  found  in  [59]  and  [60]. 

4.2  MaxPar  fS.  Ho) 

Overview 

The  MaxPar  performance  simulator[53,61],  written  at  CSRD,  employs  execution-driven  sim¬ 
ulation  to  estimate  parallel  performance.  It  was  inspired  by  Fortran  parallelism  estimation  tech- 
niques[62],  and  their  first  implementation,  as  found  in  COMET[63]. 

MaxPar’s  input  is  a  sequential  Fortran  77  program.  MaxPar  defines  simulated  time  in  programs 
in  terms  of  an  abstract  machine,  where  operations  require  a  predefined  amount  of  time.  The  abstract 
machine  may  be  deined  with  a  finite  or  finite  number  of  processors.  For  each  variable  in  the  original 
program,  MaxPar  creates  shadow  variables.  The  read  shadow  and  write  shadow  represent  the  last 
point  in  simulated  time  at  which  the  associated  variable  was  read  or  written,  respectively.  MaxPar 
then  identifies  all  of  the  operations  upon  variables  (e.g.,  add,  multiply)  and  augments  the  program 
with  code  that  computes  the  new  values  for  the  associated  shadows.  By  scheduling  operations 
according  to  the  shadow  values  that  it  computes  (effectively  computing  all  data  dependences  at 
runtime),  MaxPar  is  able  to  compute  a  histogram  of  the  available  parallelism  for  each  simulated 
time  unit.  It  also  can  generate  a  trace  file  containing  all  of  the  operations  as  it  has  scheduled  them. 

Although  the  primary  development  of  MaxPar,  which  is  ongoing,  was  supported  by  other 
agencies,  we  made  some  modifications  to  MaxPar[64]  to  enable  it  to  be  used  together  with  our 
trace-based  performance  tools  to  identify  performance  bottlenecks. 

Extensions  for  TY-aceView 

We  extended  MaxPar  to  generate  trace  records  in  the  format  used  by  Trace  View  (Section  4.1.1). 
MaxPar  predicts  the  time  when  each  routine  in  an  application  is  entered  and  exited,  and  writes  these 
to  the  trace  file.  It  also  is  capable  of  writing  its  histogram  of  parallelism  in  Trace  View-compatible 
format.  These  modifications  allow  side-by-side  comparisons  of  MaxPar  predicted  results  and  actual 
measured  results.  Such  comparisons  may,  for  example,  compare  how  well  the  actual  execution 
takes  advantage  of  parallelism,  compared  to  the  parallelism  figures  reponed  by  MaxPar. 

The  Parallel  Performance  Advisor 

MaxPar  was  also  modified  to  work  more  closely  with  the  trace-based  performance  tools  CPROF 
and  SuperVu  (Section  4. 1 .2)  to  combine  measured  performance  data  with  predictions.  This  project, 
called  the  Parallel  Performance  Advisor  (PPA)  brought  together  these  two  types  of  analysis  to  find 
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the  portions  of  the  code  with  the  greatest  time  consumption  and  unexploited  parallelism.  An 
overview  of  this  approach  was  published  in  (65],  Additional  examples  are  described  in  [64], 

The  PPA  consists  of  two  major  pans.  First,  the  trace-based  performance  tool  CPROF  is  used 
to  create  a  profile  of  the  application’s  actual  execution  time.  Based  upon  this  information,  CPROF 
generates  a  list  of  directives  to  MaxPar  specifying  the  routines  in  each  of  the  most  time-consuming 
subtrees  of  the  call  graph. 

We  modified  MaxPar  to  restrict  its  analysis  to  the  specified  subtree  of  routines,  assuming  each 
invocation  of  the  routines  is  disjoint.  In  general  there  are  a  small  number  of  these  subtrees,  and 
each  is  processed  by  a  separate  invocation  of  MaxPar.  These  MaxPar  processes  are  independent 
from  one  another  and  can  be  run  in  parallel. 

To  collect  data  within  a  subtree  of  routines,  we  extended  MaxPar  to  permit  gathering  data  at 
the  module  level,  instead  of  only  computing  program-level  summaries.  In  addition,  a  barrier  is 
inserted  at  the  entry  and  exit  points  of  the  highest-level  routine  to  be  instrumented. 

MaxPar’s  output  summarizes  the  parallelism  available  within  each  routine  and  its  descendants. 
The  interactive  tool  SuperVu  reads  MaxPar’s  output  and  combines  it  with  the  measured  perfor¬ 
mance  data,  which  it  can  then  display  in  the  form  of  a  dynamic  calling  tree.  Figure  5  shows  an 
example  for  FL052,  one  of  the  Perfect  Benchmarks®.  Each  box  depicts  performance  information 
for  one  routine  in  the  calling  tree,  giving  the  name  of  the  routine,  the  time  consumed  by  the  routine 
(in  seconds),  and  the  number  of  calls  to  the  routine.  In  addition,  the  measured  data  is  augmented 
with  the  MaxPar-computed  parallelism  (which  is  shown  in  parenth  'Cs),  giving  the  percentage  of 
total  time  attributable  to  the  routine  and  its  average  parallelism. 

By  examining  the  combined  MaxPar  and  measured  data,  the  PPA  user  can  identify  likely 
candidates  for  performance  improvement.  In  the  example  in  Figure  5,  for  example,  over  70%  of 
the  total  CPU  time  is  consumed  by  the  four  routines  df  lux,  ef  lux,  psraoo,  and  df  luxe,  and 
they  also  show  good  parallelism  values.  Thus,  a  good  start  would  be  to  parallelize  these  routines 
first. 

MaxPar  also  was  modified  to  perform  its  analysis  only  for  a  specified  interval  of  an  application’s 
total  execution  time.  To  use  this  facility,  the  user  first  runs  MaxPar  on  the  entire  application  to 
get  an  overall  parallelism  histogram  and  identify  the  major  phases  in  the  computation,  and  then 
“zooms  in”  on  intervals  of  interest  to  determine  how  the  performance  is  affected  by  changes  in  the 
parallelism,  synchronization  costs,  etc. 
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Figure  5:  FLX)52  Dynamic  Calling  Tree  from  CPROF  and  MaxPar 
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